{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "OpenRec Tutorial #1.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "[View in Colaboratory](https://colab.research.google.com/github/ylongqi/openrec/blob/master/tutorials/OpenRec_Tutorial_1.ipynb)"
      ]
    },
    {
      "metadata": {
        "id": "EWey7zLleCFQ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "\n",
        "Get Started\n",
        "---\n",
        "by *[Longqi@Cornell](http://www.cs.cornell.edu/~ylongqi)* licensed under [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)\n",
        "\n",
        "This tutorial demonstrates the process of training and evaluating recommendation algorithms using OpenRec (>=0.2.0):\n",
        "\n",
        "*   Prepare training and evaluation datasets.\n",
        "*   Instantiate samplers for training and evaluation.\n",
        "*   Instantiate a recommender.\n",
        "*   Instantiate evaluators.\n",
        "*   Instantiate a model trainer.\n",
        "*   TRAIN AND EVALUATE!"
      ]
    },
    {
      "metadata": {
        "id": "fD_18EbZe8aw",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Prepare training and evaluation datasets\n",
        "---\n",
        "*   Download your favorite dataset from the web. In this tutorial, we use [a relatively small citeulike dataset](http://www.wanghao.in/CDL.htm) for demonstration purpose."
      ]
    },
    {
      "metadata": {
        "id": "igm-6STld-bG",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!apt-get install unrar\n",
        "!pip install openrec\n",
        "\n",
        "import os\n",
        "try:\n",
        "    from urllib.request import urlretrieve\n",
        "except ImportError:\n",
        "    from urllib import urlretrieve\n",
        "\n",
        "urlretrieve('http://www.wanghao.in/data/ctrsr_datasets.rar', 'ctrsr_datasets.rar')\n",
        "os.system('unrar x ctrsr_datasets.rar')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "_2hSL6njf6pW",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "*   Convert raw data into [numpy structured array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). As required by the [Dataset](https://github.com/ylongqi/openrec/blob/master/openrec/utils/dataset.py) class, two keys **user_id** and **item_id** are required. Each row in the converted numpy array represents an interaction. The array might contain additional keys based on the use cases.\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "Jn-syZ27fgLQ",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import random\n",
        "\n",
        "total_users = 0\n",
        "interactions_count = 0\n",
        "with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:\n",
        "    for line in fin:\n",
        "        interactions_count += int(line.split()[0])\n",
        "        total_users += 1\n",
        "\n",
        "# radomly hold out an item per user for validation and testing respectively.\n",
        "val_structured_arr = np.zeros(total_users, dtype=[('user_id', np.int32), \n",
        "                                                  ('item_id', np.int32)]) \n",
        "test_structured_arr = np.zeros(total_users, dtype=[('user_id', np.int32), \n",
        "                                                   ('item_id', np.int32)])\n",
        "train_structured_arr = np.zeros(interactions_count-total_users * 2, \n",
        "                                dtype=[('user_id', np.int32), \n",
        "                                       ('item_id', np.int32)])\n",
        "\n",
        "interaction_ind = 0\n",
        "next_user_id = 0\n",
        "next_item_id = 0\n",
        "map_to_item_id = dict()  # Map item id from 0 to len(items)-1\n",
        "\n",
        "with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:\n",
        "    for line in fin:\n",
        "        item_list = line.split()[1:]\n",
        "        random.shuffle(item_list)\n",
        "        for ind, item in enumerate(item_list):\n",
        "            if item not in map_to_item_id:\n",
        "                map_to_item_id[item] = next_item_id\n",
        "                next_item_id += 1\n",
        "            if ind == 0:\n",
        "                val_structured_arr[next_user_id] = (next_user_id, \n",
        "                                                    map_to_item_id[item])\n",
        "            elif ind == 1:\n",
        "                test_structured_arr[next_user_id] = (next_user_id, \n",
        "                                                     map_to_item_id[item])\n",
        "            else:\n",
        "                train_structured_arr[interaction_ind] = (next_user_id, \n",
        "                                                         map_to_item_id[item])\n",
        "                interaction_ind += 1\n",
        "        next_user_id += 1"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "KOFB8DOejoIu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "*   Instantiate training, validation, and testing datasets using the Dataset class."
      ]
    },
    {
      "metadata": {
        "id": "OM8-0XbkhG3r",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils import Dataset\n",
        "\n",
        "train_dataset = Dataset(raw_data=train_structured_arr,\n",
        "                        total_users=total_users, \n",
        "                        total_items=len(map_to_item_id), \n",
        "                        name='Train')\n",
        "val_dataset = Dataset(raw_data=val_structured_arr,\n",
        "                      total_users=total_users,\n",
        "                      total_items=len(map_to_item_id),\n",
        "                      num_negatives=500,\n",
        "                      name='Val')\n",
        "test_dataset = Dataset(raw_data=test_structured_arr,\n",
        "                       total_users=total_users,\n",
        "                       total_items=len(map_to_item_id),\n",
        "                       num_negatives=500,\n",
        "                       name='Test')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "q8xdoV6nkDx_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Instantiate samplers\n",
        "---\n",
        "*  For training, **RandomPairwiseSampler** is used, i.e., each instance contains an user, an item that the user interacts, and an item that the user did NOT interact.\n",
        "*  For evaluation, **EvaluationSampler** is used. It feeds in user interaction data one user at a time. For a user, (relevant and irrelevant) items are divided into batches and evaluated seperately."
      ]
    },
    {
      "metadata": {
        "id": "qb8G4Kkx0g6Z",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.samplers import RandomPairwiseSampler\n",
        "from openrec.tf1.utils.samplers import EvaluationSampler\n",
        "\n",
        "train_sampler = RandomPairwiseSampler(batch_size=1000, \n",
        "                                      dataset=train_dataset, \n",
        "                                      num_process=5)\n",
        "val_sampler = EvaluationSampler(batch_size=1000, \n",
        "                                dataset=val_dataset)\n",
        "test_sampler = EvaluationSampler(batch_size=1000, \n",
        "                                 dataset=test_dataset)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FAAyCiTV0tik",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Instantiate a recommender\n",
        "---\n",
        "*  We use the [BPR recommender](https://github.com/ylongqi/openrec/blob/master/openrec/recommenders/bpr.py) that implements the pure Baysian Personalized Ranking (BPR) algorithm."
      ]
    },
    {
      "metadata": {
        "id": "I_ovJrcm1ADu",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.recommenders import BPR\n",
        "\n",
        "bpr_model = BPR(batch_size=1000, \n",
        "                total_users=train_dataset.total_users(), \n",
        "                total_items=train_dataset.total_items(), \n",
        "                dim_user_embed=50, \n",
        "                dim_item_embed=50, \n",
        "                save_model_dir='bpr_recommender/', \n",
        "                train=True, serve=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "PwvAJ-2a1Qcu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Instantiate evaluators\n",
        "---\n",
        "*  Define evaluators that you plan to use. This tutorial evaluate the recommender against Area Under Curve (AUC).\n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "ZFQ-P3dK1WLr",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec.tf1.utils.evaluators import AUC\n",
        "\n",
        "auc_evaluator = AUC()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "cYgI7yA11YtY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Instantiate a model trainer\n",
        "---\n",
        "*  The model trainer wraps a recommender and makes it ready for training and evaluation."
      ]
    },
    {
      "metadata": {
        "id": "JOixyvXk11pI",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from openrec import ModelTrainer\n",
        "\n",
        "model_trainer = ModelTrainer(model=bpr_model)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "mSG1oFNK17UV",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "TRAIN AND EVALUATE\n",
        "---"
      ]
    },
    {
      "metadata": {
        "id": "LCeXalVE2FAh",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "model_trainer.train(total_iter=10000,  # Total number of training iterations\n",
        "                    eval_iter=1000,    # Evaluate the model every \"eval_iter\" iterations\n",
        "                    save_iter=10000,   # Save the model every \"save_iter\" iterations\n",
        "                    train_sampler=train_sampler, \n",
        "                    eval_samplers=[val_sampler, test_sampler], \n",
        "                    evaluators=[auc_evaluator])"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}